-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG/PERF: Series.combine_first converting int64 to float64 #51777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/PERF: Series.combine_first converting int64 to float64 #51777
Conversation
If the solution to this issue is to convert floats back to ints when they get converted to floats, that will not work correctly for some ints. For example:
It would be good to include big ints like these in the tests. |
pandas/core/series.py
Outdated
@@ -3272,7 +3274,13 @@ def combine_first(self, other) -> Series: | |||
if this.dtype.kind == "M" and other.dtype.kind != "M": | |||
other = to_datetime(other) | |||
|
|||
return this.where(notna(this), other) | |||
combined = this.where(notna(this), other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this won't break anything, but for the series case there might be a proper fix. You could set a compatible fill_value for the reindex ops. Only disadvantage is, that you'll have to compute notna(this)
somehow
great catch! thanks. I've updated the PR to handle this and updated the test. This also provides a nice perf improvement:
And I'll note the perf improvement is not specific to the integer case, here is
|
pandas/core/series.py
Outdated
return this.where(notna(this), other) | ||
combined = concat([this, other]).reindex(new_index, copy=False) | ||
combined.name = self.name | ||
return combined |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a finalise call with self, otherwise we will loose metadata.
This might changed result ordering? Can we do a reindex in the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks
pandas/core/series.py
Outdated
this = self | ||
null_mask = isna(this) | ||
if null_mask.any(): | ||
drop = this.index[null_mask].intersection(other.index[notna(other)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a quick comment what this is doing? Took me a while to figure it out, don't want to do this again in 6 months time :)
Otherwise lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've simplified this logic a bit. should be easier to follow now
It looks like you are using these big integers in the tests to test for loss of precision:
But note that there is no loss in going from int to float to int for these special numbers:
It would be good to use ints in the tests where loss of precision actually could happen, e.g.:
|
updated, thanks @jessestone7 |
Thanks again @lukemanley |
I just tried this on Pandas version 2.2.2 and I see that there is a loss of precision:
|
doc/source/whatsnew/v2.1.0.rst
file if fixing a bug or adding a new feature.